Goto

Collaborating Authors

 critic agent


Towards Cognitive Synergy in LLM-Based Multi-Agent Systems: Integrating Theory of Mind and Critical Evaluation

Kostka, Adam, Chudziak, Jarosław A.

arXiv.org Artificial Intelligence

Recently, the field of Multi-Agent Systems (MAS) has gained popularity as researchers are trying to develop artificial intelligence capable of efficient collective reasoning. Agents based on Large Language Models (LLMs) perform well in isolated tasks, yet struggle with higher-order cognition required for adaptive collaboration. Human teams achieve synergy not only through knowledge sharing, but also through recursive reasoning, structured critique, and the ability to infer others' mental states. Current artificial systems lack these essential mechanisms, limiting their ability to engage in sophisticated collective reasoning. This work explores cognitive processes that enable effective collaboration, focusing on adaptive theory of mind (ToM) and systematic critical evaluation. We investigate three key questions. First, how does the ability to model others' perspectives enhance coordination and reduce redundant reasoning? Second, to what extent does structured critique improve reasoning quality by identifying logical gaps and mitigating biases? Third, the interplay of these mechanisms can lead to emergent cognitive synergy, where the collective intelligence of the system exceeds the sum of its parts. Through an empirical case study on complex decision making, we show that the integration of these cognitive mechanisms leads to more coherent, adaptive, and rigorous agent interactions. This article contributes to the field of cognitive science and AI research by presenting a structured framework that emulates human-like collaborative reasoning MAS. It highlights the significance of dynamic ToM and critical evaluation in advancing multi-agent systems' ability to tackle complex, real-world challenges.


VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

Smyth, Ethan, Suglia, Alessandro

arXiv.org Artificial Intelligence

Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at https://esmyth-dev.github.io/VoyagerVision.github.io/


Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

Gong, Nanxu, Wang, Xinyuan, Ying, Wangyang, Bai, Haoyue, Dong, Sixun, Chen, Haifeng, Fu, Yanjie

arXiv.org Artificial Intelligence

Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets. Our code is publicly available at https://github.com/NanxuGong/LPFG.


MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving

Zhang, Jian, Wang, Zhiyuan, Wang, Zhangqi, Zhang, Xinyu, Xu, Fangzhi, Lin, Qika, Mao, Rui, Cambria, Erik, Liu, Jun

arXiv.org Artificial Intelligence

Multimodal scientific problems (MSPs) involve complex issues that require the integration of multiple modalities, such as text and diagrams, presenting a significant challenge in artificial intelligence. While progress has been made in addressing traditional scientific problems, MSPs still face two primary issues: the challenge of multi-modal comprehensive reasoning in scientific problem-solving and the lack of reflective and rethinking capabilities. To address these issues, we introduce a Multi-Agent framework based on the Big Seven Personality and Socratic guidance (MAPS). This framework employs seven distinct agents that leverage feedback mechanisms and the Socratic method to guide the resolution of MSPs. To tackle the first issue, we propose a progressive four-agent solving strategy, where each agent focuses on a specific stage of the problem-solving process. For the second issue, we introduce a Critic agent, inspired by Socratic questioning, which prompts critical thinking and stimulates autonomous learning. We conduct extensive experiments on the EMMA, Olympiad, and MathVista datasets, achieving promising results that outperform the current SOTA model by 15.84% across all tasks. Meanwhile, the additional analytical experiments also verify the model's progress as well as generalization ability.


Saarthi: The First AI Formal Verification Engineer

Kumar, Aman, Gadde, Deepak Narayan, Radhakrishna, Keerthan Kopparam, Lettnin, Djones

arXiv.org Artificial Intelligence

Recently, Devin has made a significant buzz in the Artificial Intelligence (AI) community as the world's first fully autonomous AI software engineer, capable of independently developing software code [1] [2]. Devin uses the concept of agentic workflow in Generative AI (GenAI), which empowers AI agents to engage in a more dynamic, iterative, and self-reflective process. With Saarthi, verification engineers can focus on more complex problems, and verification teams can strive for more ambitious goals. The domain-agnostic implementation of Saarthi makes it scalable for use across various domains such as RTL design, UVM-based verification, and others. Hardware design verification, especially formal verification, entails a methodical and disciplined approach to the planning, development, execution, and sign-off of functionally correct hardware designs. Formal verification uses mathematical methods to prove the correctness of hardware designs against their specifications, ensuring that all possible states and inputs are considered, which complements traditional simulation-based verification techniques that might only cover a subset of possible scenarios due to practical constraints [3]. The formal verification process encompasses several key roles, such as organizational coordination, task allocation, code development, property proving, analyzing Counter Examples (CEXs), debugging, coverage closure, and documentation preparation. These roles are crucial for managing the complexity and ensuring the thoroughness of the verification process. For instance, analyzing counterexamples involves identifying specific scenarios where the design might fail to meet its specifications, which is critical for debugging and refining the design. This highly intricate activity demands meticulous attention to detail, given its long development cycles and the critical nature of ensuring hardware functionality and reliability [4]. The field of Natural Language Processing (NLP) has undergone a significant transformation with the advent of Large Language Models (LLMs) [5].


Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation

Li, Yuelei, Yan, Ge, Macaluso, Annabella, Ji, Mazeyu, Zou, Xueyan, Wang, Xiaolong

arXiv.org Artificial Intelligence

The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment of 3D feature fields have expanded the horizons of robotic capabilities. These developments hold significant potential for bridging the gap between high-level reasoning from LMMs and low-level control policies utilizing 3D feature fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners and 3D skill Policies. Our approach consists of three key perspectives: high-level planning, low-level control, and effective integration. For high-level planning, LMM-3DP supports dynamic scene understanding for environment disturbances, a critic agent with self-feedback, history policy memorization, and reattempts after failures. For low-level control, LMM-3DP utilizes a semantic-aware 3D feature field for accurate manipulation. In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration. We extensively evaluate our approach across multiple skills and long-horizon tasks in a real-world kitchen environment. Our results show a significant 1.45x success rate increase in low-level control and an approximate 1.5x improvement in high-level planning accuracy compared to LLM-based baselines. Demo videos and an overview of LMM-3DP are available at https://lmm-3dp-release.github.io.


Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Subramaniam, Vighnesh, Du, Yilun, Tenenbaum, Joshua B., Torralba, Antonio, Li, Shuang, Mordatch, Igor

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.


CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Li, Jierui, Le, Hung, Zhou, Yingbo, Xiong, Caiming, Savarese, Silvio, Sahoo, Doyen

arXiv.org Artificial Intelligence

Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.


Enhancing Financial Question Answering with a Multi-Agent Reflection Framework

Fatemi, Sorouralsadat, Hu, Yuheng

arXiv.org Artificial Intelligence

While Large Language Models (LLMs) have shown impressive capabilities in numerous Natural Language Processing (NLP) tasks, they still struggle with financial question answering (QA), particularly when numerical reasoning is required. Recently, LLM-based multi-agent frameworks have demonstrated remarkable effectiveness in multi-step reasoning, which is crucial for financial QA tasks as it involves extracting relevant information from tables and text and then performing numerical reasoning on the extracted data to infer answers. In this study, we propose a multi-agent framework incorporating a critic agent that reflects on the reasoning steps and final answers for each question. Additionally, we enhance our system by adding multiple critic agents, each focusing on a specific aspect of the answer. Our results indicate that this framework significantly improves performance compared to single-agent reasoning, with an average performance increase of 15% for the LLaMA3-8B model and 5% for the LLaMA3-70B model. Furthermore, our framework performs on par with, and in some cases surpasses, larger single-agent LLMs such as LLaMA3.1-405B and GPT-4o-mini, though it falls slightly short compared to Claude-3.5 Sonnet. Overall, our framework presents an effective solution to enhance open-source LLMs for financial QA tasks, offering a cost-effective alternative to larger models like Claude-3.5 Sonnet.


Athena: Safe Autonomous Agents with Verbal Contrastive Learning

Sadhu, Tanmana, Pesaranghader, Ali, Chen, Yanan, Yi, Dong Hoon

arXiv.org Artificial Intelligence

Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step. Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.